GESIS Workshop: Introduction to Geospatial Techniques for Social Scientists in R
Stefan Jünger & Dennis Abel
2025-04-10
Now
Day
Time
Title
April 09
10:00-11:30
Introduction
April 09
11:30-11:45
Coffee Break
April 09
11:45-13:00
Data Formats
April 09
13:00-14:00
Lunch Break
April 09
14:00-15:30
Mapping I
April 09
15:30-15:45
Coffee Break
April 09
15:45-17:00
Spatial Wrangling
April 10
09:00-10:30
Mapping II
April 10
10:30-10:45
Coffee Break
April 10
10:45-12:00
Applied Spatial Linking
April 10
12:00-13:00
Lunch Break
April 10
13:00-14:30
Spatial Autocorrelation
April 10
14:30-14:45
Coffee Break
April 10
14:45-16:00
Spatial Econometrics & Outlook
What are georeferenced data?
Data with a direct spatial reference \(\rightarrow\)geo-coordinates
Information about geometries
Optional: Content in relation to the geometries
Sources: OpenStreetMap / GEOFABRIK (2018), City of Cologne (2014), and the Statistical Offices of the Federation and the Länder (2016) / Jünger, 2019
Georeferenced survey data
Survey data enriched with geo-coordinates (or other direct spatial references).
With georeferenced survey data, we can analyze interactions between individual behaviors and attitudes and the environment.
An example workflow
:::: ::: {.column width=“50%”} From the addresses to analyses with georeferenced survey data, several steps and challenges along the way. We will talk about:
Data Protection & Data Access
Geocoding
Spatial Data Linking
Applied Examples :::
::::
Data protection
That‘s one of the biggest issues.
Explicit spatial references increase the risk of re-identifying anonymized survey respondents
Can occur during the processing of data but also during the analysis
Affects all phases of research and data management!
Data availability
Geospatial Data
Often de-centralized distributed
Fragmented data landscape, at least in Germany
Georeferenced Survey Data
Primarily, survey data
Depends on documentation
Access difficult due to data protection restrictions
:::: ::: {.column width=“50%”} In Germany, storing personal information such as addresses in the same place as actual survey attributes is usually not allowed.
Projects keep them in separate locations
Can only be matched with a correspondence table
Necessary to conduct data linking :::
Jünger, 2019
::::
Geocoding
Geocoding is the conversion of indirect spatial references (e.g., addresses) into direct spatial references (e.g., coordinates)
However, conducting this procedure is tricky (not only in R). Many services are either
Expensive (at least they cost money or have other restrictions)
Probably not data protection-friendly (Hey Google)
Or both
OSM Is Your Friend
We can use the Nominatim API from OSM to geocode at least a few addresses.
:::: ::: {.column width=“50%”} The geocoding tool automatically retrieves point coordinates, administrative unit keys, and grid cell IDs. Spatial joins based on coordinates for other units:
Constituencies
Administrative units across time (e.g., harmonized territorial status) :::
Sources: OpenStreetMap / GEOFABRIK (2018), City of Cologne (2014), Leibniz Institute of Ecological Urban and Regional Development (2018), Statistical Offices of the Federation and the Länder (2016), and German Environmental Agency / EIONET Central Data Repository (2016) / Jünger, 2019
::::
Data Linking
Linking via common identifiers is most commonly used but comes with challenges (e.g., territorial status and land reforms? Comparable units? Heterogeneity within units?).
Spatial linking methods (examples) I
1:1
(sf::st_join())
Distances
(sf::st_distance())
Sources: German Environmental Agency / EIONET Central Data Repository (2016) and OpenStreetMap / GEOFABRIK (2018) / Jünger, 2019
Spatial linking methods (examples) II
Filter methods
(sf::st_filter() or terra::vect(..., filter = ...))
Buffer zones
(sf::st_buffer())
Sources: Leibniz Institute of Ecological Urban and Regional Development (2018) and Statistical Offices of the Federation and the Länder (2016) / Jünger, 2019
Cheatsheet: Spatial Operations
An overview of spatial operations using the sf package can be accessed here.
Data Aggregation
If you want to aggregate attributes and geometries of a shapefile, you can rely on st_combine(x) , st_union(x,y) and st_intersection(x, y) to combine shapefiles, resolve borders and return the intersection of two shapefiles.
For raster data, you can aggregate with the function terra::aggregate()(if you have matching raster files) in combination with terra::resample() (if your raster files don’t match).
Say we’re interested in the impact of neighborhood characteristics (e.g., mobility infrastructure) on individual-level attitudes towards energy transition.
We plan to conduct a survey which is representative of the population of Germany.
https://imgflip.com/i/9ptcuu
Synthetic georeferenced survey data
We have added a synthetic dataset of 2,000 geocoordinates in the ./data/ folder (aggregated to 1 sq km centroids). Initially, it was based on a sample of the georeferenced GESIS Panel.
synthetic_survey_coordinates <-readRDS("./data/synthetic_survey_coordinates.rds")tmaptools::read_osm( synthetic_survey_coordinates, type ="esri-topo") |> terra::rast() |>tm_shape() +tm_rgb() +tm_shape(synthetic_survey_coordinates) +tm_dots(size =2, col ="red")
Correspondence table
As in any survey that deals with addresses, we need a correspondence table of the distinct identifiers.
We ‘ask’ respondents for some standard sociodemographics. But we also include an item from the GLES Panel on energy transformation: “From 2030, no more new cars with petrol or diesel engines are to be registered in Germany. How much do you agree?” (entrans). Since we cannot share the actual data, we created fake data using the faux package.
Better access to charging infrastructure means higher support for energy transformation.
Rural-urban divide
Higher population density means higher support for energy transformation.
District-level data
We already have most of our information on the district level from yesterday.
district_attributes <-# load district shapefile sf::read_sf("./data/VG250_KRS.shp") |># add attribute table dplyr::left_join( readr::read_delim("./data/attributes_districts.csv", delim =";"), by ="AGS" )
District operationalization
Access to charging infrastructure
Charging stations per 1000 inhabitants in a district
Rural-urban divide
Population Density in a district
Access to charging infrastructure
Luckily, we did something similar yesterday!
charger_data <-# Load charging station points readr::read_delim("./data/charging_points_ger.csv", delim =";") |># Filter out rows with missing longitude or latitude dplyr::filter(!is.na(longitude) &!is.na(latitude)) |># Convert data frame to sf object sf::st_as_sf(coords =c("longitude", "latitude"), crs =4326) |># Reproject the spatial data to the desired CRS (Coordinate Reference System) sf::st_transform(crs = sf::st_crs(district_attributes))aggregated_charger_data <- charger_data |># spatial join district ids sf::st_join(district_attributes |> dplyr::select(AGS)) |># Group by district ID dplyr::group_by(AGS) |># Summarize the number of chargers in each district dplyr::summarise(charger_count = dplyr::n()) |># Drop geometry column sf::st_drop_geometry()district_attributes <-# Left join with sampling area attributes dplyr::left_join( district_attributes, aggregated_charger_data, by ="AGS" ) |># Calculate charger density per 1000 population dplyr::mutate(charger_dens = (charger_count *1000) / population)
Rural-urban divide
Our attribute table contains the number of inhabitants per district but not the population density. Therefore, we need to calculate the area of the district.
# calculate area of districts# areas will always be calculated# in units according to the CRS sf::st_area(district_attributes) |>head(4)
# calculating population densitydistrict_attributes <- district_attributes |># calculate area of districts (areas will always# be calculated in units according to the CRS ) dplyr::mutate(area = sf::st_area(district_attributes) ) |># change unit to square kilometers dplyr::mutate(area_km2 = units::set_units(area, km^2) ) |># recode variable as numeric dplyr::mutate(area_km2 =as.numeric(area_km2) ) |># calculate population density dplyr::mutate(pop_dens = population/ area_km2 )tm_shape(district_attributes) +tm_fill("pop_dens", fill.scale =tm_scale(breaks =c(0,100,200,400,800,1600,3200, Inf) ) ) +tm_layout(legend.outside =TRUE)
Respondents in districts
We have population density on the district level. Since our analysis focuses on the individual level, we can spatially join the information to our fake respondents’ coordinates.
district_linked_df <- district_attributes |> sf::st_transform(sf::st_crs(synthetic_survey_coordinates)) |># keeping just the variables we want dplyr::select(charger_dens, publictransport_meandist, pop_dens) |># since we want to join district to# respondent defining coordinates first sf::st_join(x = synthetic_survey_coordinates,# district data secondy = _,# some points may lie on the border# choosing intersects join = sf::st_intersects ) |># drop our coordinates for data protection sf::st_drop_geometry()
We have our nice fake coordinates and know we also have variations in some districts (e.g., Cologne) concerning e-car mobility. Let’s try to operationalize the variables on a smaller level of aggregation.
Access to charging infrastructure > Charging stations in a 5000m buffer
Rural-urban divide > Population in a 5000m buffer
Charging stations in 5000m buffer
The procedure for calculating the number of chargers in a 5km buffer is very similar to calculating the chargers in a district.
# Create 5000m buffers around the fake coordinatesbuffers <- synthetic_survey_coordinates |> sf::st_buffer(dist =5000)# Perform intersection between buffers and points_sfinter <- sf::st_intersects(buffers, charger_data |> sf::st_transform(3035))# Count points within each buffercoordinate_linked_df <- synthetic_survey_coordinates |> dplyr::mutate(num_charger =lengths(inter))
Distance calculation II
sf::st_distance() will calculate between all respondents and all train stations resulting in a huge matrix. We can make our lives easier by first identifying the nearest station and then calculating the distance.
# Find the nearest charging station nearest_charger <- sf::st_nearest_feature( synthetic_survey_coordinates, charger_data |> sf::st_transform(3035) )# Calculate the distance between each point in# fake_coordinates & its nearest charging stationcoordinate_linked_df <- coordinate_linked_df |> dplyr::mutate(charger_distances = sf::st_distance( synthetic_survey_coordinates, charger_data[nearest_charger,] |> sf::st_transform(3035), by_element =TRUE ) |>as.vector() )
Distance calculation II
# add a column for the distancescoordinate_linked_df <- coordinate_linked_df |> dplyr::mutate(# Calculate distances in kilometers charg_dist_km =as.numeric(charger_distances) /1000) summary(coordinate_linked_df$charg_dist_km)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.01544 0.36216 0.72309 1.39690 1.80339 13.65649
Population buffers
…and we’re not yet done: we still need the population in the neighborhood. Let’s calculate buffers of 5000 meters and add the population mean values to our dataset.
# download data & extract informationinhabitants <- z11::z11_get_1km_attribute(Einwohner)# spatially link "on the fly"population_buffers <- terra::extract( inhabitants, synthetic_survey_coordinates |> sf::st_buffer(5000), fun = mean,na.rm =TRUE,ID =FALSE,raw =TRUE ) |>unlist()# link with data coordinate_linked_df <- coordinate_linked_df |> dplyr::mutate(population_buffer = population_buffers)
Join with Survey
We hope you’re not tired of joining data tables. Since we care a bit more about data protection, we have yet another joining task: to join the information we received using our (protected) fake coordinates to the actual survey data via the correspondence table.
# last joins for nowfake_survey_data_spatial <-# first join the id dplyr::left_join(correspondence_table, district_linked_df, by ="id") |> dplyr::left_join(coordinate_linked_df, by ="id") |># join the survey data dplyr::left_join(fake_survey_data, by ="id") |> dplyr::select(-id)